HERA Memo: Creating Auto-Correlations with a Generative Adversarial Neural Network

Joseph C. Shy, 2021 CHAMP Scholar, 08/13/2021

Questions? Contact me: jshy@calpoly.edu or joeyshy883@gmail.com

NOTE: If running this notebook locally, please download all files from /users/jshy/mylustre/machine_learning/2459122-H4C_Machine-Learning_Practice if lustre is accessible. Make note that some directory names will need to be changed before some functions are ran again.

If access to lustre is not possible, the notebook will not be useable, as HERA auto-correlations are required .

GITHUB for project: (https://github.com/jshy883/jshy883-2459122-H4C_Machine-Learning_Practice-)

1. Introduction & Machine Learning Basics

1.1. Machine Learning applied to HERA

As the HERA radio antenna array continues to be developed, modified, and built upon, the challenge of identifying working antennas from their broken counterparts becomes a major priority before analysis can be performed with the data retrieved. Currently, the standard for the HERA collaboration is to visually assess auto-correlations that are returned from every observing antenna on a specific night of observation. Each assessment is performed manually by an operator and repeated by a handful of other operators for thoroughness and redundancy.

Below are examples of arbitrary PASS and NO PASS auto-correlations. Note the large spikes in the PASS measurement. These are caused by RFI channel interference. This presentation will assist in understand the patterns operators must looks for in good auto-correlations.

The end condition of the invalid auto-correlation is far from similar to that of the good auto-correlation.

This process can prove inconvenient, as it requires a large amount of focused time to gain confidence in the flag given to a certain antenna (the flag states its potential issue or if it is cleared for use). Additionally, it is close to impossible to visually check every auto-correlation that is computed every night (typically > 100,000). Therefore, this leads to the situation where a good auto-correlation from an antenna flagged as broken may be discarded as it was not screened. It is in best interest to preserve as much data from each antenna as possible.

Machine learning, and more specifically, a generative adversarial neural network (or GAN), has potential to automate and improve the current system for flagging antennas. The current working networks that will be focused on in the sections below are designed to be able to:

  1. Screen every auto-correlation produced.
  2. Flag an auto-correlation as PASS or NO PASS

These two goals will be achievable through a system called a generative adversarial neural network. It is a machine-learning idea that trains a system to know exactly what a good-quality, PASS auto-correlation looks like. But rather than training the system to know exactly what a previously-identified, NO PASS auto-correlation looks like, it is trained on FAKE auto-correlations that deviate from REAL auto-correlations only slightly. The FAKE auto-correlations will be elaborated upon in the following sections. However, this choice of training on FAKE measurements that look very similar to a REAL auto-correlation provides the system to be able to only identify NO PASS auto-correlations ambiguously. This means it will be able to detect new errors in the antennas that have not been seen before. But this ambiguity also has a fall back, as it will not be able to define the exact problem with the auto-correlation but rather simply the presence of one.

This system will have the potential to greatly reduce the amount of time required for operators to assess the auto-correlations, instead with their roles being only to classify the issues in the NO PASS auto-correlations. Additionally, the ability to screen every auto-correlation should increase the influx of useable data for analysis.

However, even more automation of this system could be achievable in the future. One potential route for future work would be to create another machine learning architecture to feed the NO PASS measurements into. This system's goal would be to identify the exact problem with these auto-correlation. Please refer to Sec. 8.1 for more details.

1.2. Basics of a GAN

A generative adversarial neural network is a clever combination of two deep learning architectures. These deep learning architectures are "computing system(s) made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs" . The basic learning scheme for a neural network involves inputting a data set with specific flags (or classifiers) associated to each singular piece of data within the set (ie. image or plot) and allowing the processing elements within the neural network to update/learn in order to improve in its ability to classify certain sets of data. Each time a model updates its processing elements (or weights) is called an epoch. The magnitude in which the neural network updates its processing elements (or neurons) is based on the result of the loss at each training epoch. However, this is the most basic application of neural networks, but they can be manipulated to do much more, which will be elaborated upon next.

The two learning architectures described within this memo are refered to as the detector model and the generator model. The general idea is to make the two models compete against one another. The generator's goal is to incrementally improve in its ability to create fake auto-correlations that look real. The detector's goal is to become better at discriminating between REAL and FAKE measurements.

The detector model operates similarly to the basic classification neural network described above; however, the classifications that the detector trains upon are REAL and FAKE. It receives an input training set of auto-correlations that are deemed good for analysis. Accompanying these input auto-correlations are classfication flags (which are simply integers of 1 arranged in a list the size of the auto-correlation training set) that are used to communicate to the detector model that the incoming input values are REAL auto-correlations. Additionally, the detector receives FAKE auto-correlations generated from the generator model. At each epoch, the detector will train and improve in its ability to discriminate between REAL and FAKE auto-correlations.

However, the generator model trains as well, learning how to better create FAKE auto-correlations that increasingly resemble good auto-correlations. The method in which this occurs is by combining the generator and discriminator model into a larger overarching GAN learning architecture. It is important to note that within the GAN model, the discrimator cannot be trained/updated. The input is a latent space, which is a set of random numbers vectors returned from the standard normal distribution. This generator receives this latent space and performs a set of hidden mathematical operations on it, which are derived from the model's architecture and weights. The output of this generator is a FAKE auto-correlation, which is subsequently fed into the detector model with a REAL classification. This is where the generator training occurs, as the detector will most likely return large losses, as it is being fed information contradicting its previous training outside the GAN architecture. The detector model will output a loss value which the generator uses to update its model weights in order to be able to create FAKE auto-correlations that will better trick the detector. Again, the contradictory information fed into the detector will not infuence its own training and ability to discriminate between REAL and FAKE data, as within the GAN model, the detector is restricted from updating its model weights.

1.3 Objectives

With the previous aspects of the project addressed, the summation of the goals of the project are driven towards showing a proof of concept of both: building a quality, practical GAN for HERA auto-correlations and succcessfully presenting its possible application to the HERA pipeline.

The objectives are as follows:

  1. Create a GAN architecture that trains with auto-correlations efficiently (in 10,000 epochs or less). This number is an upper-limit, as preliminary GANs for this project were almost able to train in under 10,000 epochs; therefore the objective is to improve upon previous, preliminary training efficiencies.
  2. Create a useable detector that can screen auto-correlations from good antennas and broken antennas and successfully identify them as PASS (or REAL) and NO PASS (or FAKE).

2. Data Set Selection

When training a GAN, the selection of the training set is vital for streamlining time to train and to prevent training failure. With HERA's correlators constantly seeing improvement and modifications, what is seen as a good auto-correlation that validates an antenna's use can vary dependent on the data set. The trend of the measurement over the frequency range and/or the magnitude of the power measured at each frequency may differ between different sets. Additionally, each antenna produces auto-correlations of varying polarizations, which differeniate from one another as well.

As this study is to serve as a validation of GAN architecture for detecting and producing fake auto-correlations, it is important not to expand the training set to be too large and/or too complex. Showing the GAN auto-correlation sets that are all considered valid but differ signficantly from one another, will most likely result in much longer GAN training times or GAN training failure, as neural networks rely heavily on being able to distinguish overarching features and relationships between multiple auto-correlations in a training set.

Therefore, the down-selection to using only manually-validated "ee" polarized H4C auto-correlations on the observation night of 2459122 is made. This training set is used as a case study in order to learn/create a working GAN architecture that can delineate between real and fake auto-correlations and produce realistic auto-correlations from random input vectors, known as the latent space.

2.1. Creating the Data Sets

In order to train the GAN on what valid auto-correlations look like, there can only be valid data in the training set. Fortunately, the data mentioned above was chosen as it was previously validated by members of the HERA collaboration and had more successful/valid antenna measurements than other data sets. There were bad antennas within this set (which are listed under the list "ex_ants" here) that were removed from the rest of the data.

Please reference the '*.txt' files within this current directory/Github. They store the filenames, filepaths, and antenna keys (both good and bad) for accessing auto-correlations on the observation night of 2459122.

The auto-correlation data is retrieved from Lustre through the filepath /lustre/aoc/projects/hera/H4C/2459122. All files within this directory are used, with the good "ee" polarized antennas being separated, organized, and saved locally (the code is omitted as it is not the focus of this study).

The auto-correlation data is concatenated and separated into '.npy' files for quick local access. However, these files are too large for Github storage (but can be found in /users/jshy/mylustre/machine_learning/2459122-H4C_Machine-Learning_Practice in the herapost-master node of ssh.aoc.nrao.edu.

Please use this download-training-data_joseph-shy_HERA-GAN notebook, within this directory/Github, if interested how to access Lustre and create these numpy files for personal local use (HERA NRAO accessed is required).

The data is loaded in below.

NOTE: 11.25% of the valid auto-correlations are split into a separate set that will not be used for model training. This will be used a the validation set. 3.75% of the valid auto-correlations are to be used as a test set to evaluate the model after training. The resulting 75% of the valid auto-corelations will be used to train the models. These small fractions are used for validation and testing as the number of present auto-correlations is extremely large, so a small fraction still yields a signficant test/validation set (see data below).

Logistical data of the auto-correlation sets can be seen below.

3. Addressing Auto-Correlation Uncertainty

3.1. The Radiometer Equation

Theoretically, two auto-correlations retrieved from two consecutive integrations from the same radio antenna should return identical auto-correlations, as it is a measurement from nearly identical patches of the sky. However, in practice, these auto-correlations will never be identical due to the noise introduced by certain measurement devices (ie. receivers or amplifiers).

The expected noise distribution for the measurment can be understood with the Radiometer Equation. This equation quantifies the noise introduced by the measuring equipment from known properties of the measuring devices and experiment.

$\sigma_{T} = \frac{T_{sky}}{\sqrt{BW*t}}$

Tsky is the actual "sky temperature" or uncalibrated power that should be measured in an ideal auto-correlation (without losses or noise). BW is the integrated bandwidth of the auto-correlation. The integration time (or the time in which the measurement is averaged from multiple shorter exposures) is t. These values, when applied with the equation above, produce σT, which is the residual uncertainty in a sky temperature measurement.

3.2. Application to GAN Learning

Due to this random noise introduced by the measuring equipment, every auto-correlation in the training set is unique from one another. To the GAN, this uniqueness is seemingly random at surface level, as the neural networks only have access to the raw measurements within the training set. Therefore, the generator would not only have to learn the different, recurring auto-correlation trends and RFI channel patterns, but it would also be forced to learn how to produce fake auto-correlations with random noise. This proved to be a difficult task for the generator early in model development, as a realisic fake auto-correlation could not be produced, even after 10,000 epochs (or iterations in which the GAN weights are updated and the model "learns").

Therefore, in order to assist the generator in training and shorten the training duration, the noise quantified by the Radiometer Equation was implemented into the models and training function. The motivation behind the use of a noise model is to relieve some of the features the generator is required to learn (ie. radiometer noise) and instead allow it to focus on learning idealized auto-correlation patterns.

The generator produces some power, or Tsky, at each frequency channel. Anytime the the generated (FAKE) measurement is input into or used to train the detector, a multiplicative random noise factor is applied to the sky temperature at each frequency. This noise factor is a random value drawn from a 1-centered gaussian distribution with a standard deviation drawn from the Radiometer Equation.

$stddev = \frac{1}{\sqrt{BW*t}}$

The process of calculating this standard deviation for an H4C auto-correlation on the night of 2459122 is shown below.

4. Model Building

The neural network models were built through the heavy use of the Keras API. Keras is an accessible method of employing TensorFlow for machine-learning tasks.

Most, if not all, neural network structures are built with the following high-level components:

  1. Input Layer: receives input data set into the model
  2. Hidden Layers: intermediate computation is performed and information/weights are transfered to the next layer
  3. Output Layer: layer where final activation function occurs and maps to a desired output type
  4. Connections/Weights: constantly changing/updating values (during training) that transfer the output of a neuron to the input of a neuron in the next layer
  5. Activation Function: defines output of a layer; varies dependent on function type
  6. Learning Rule: algorithm that aims to optimize the model training or produce a favored output, usually by method of modifying weights

It is important to note that GANs are a relatively new idea being researched in the machine-learning community, with the first paper researching adversarial networks being published in 2014. That being said, much of GAN development is motivated by previously proven techniques, causing many choices in model architecture being made due to ideas that have worked in other proven machine-learning projects. At times, the chosen hyperparameters are best optimized through trial and error.

A few general standards are adopted from other experts in the machine learning community. These standards are used to prevent a specific failure called mode collapse that was seen in very early preliminary training of these models. The reference to mode collapse above points to methods of preventing it as well. The two main guidelines are to use LeakyReLU activation with a slope of 0.2 for each hidden layer and to use the ADAM stochastic gradient descent learning rule with a learning rate of 0.0002 and a momentum of 0.5 as the model optimizer.

With these guidelines implemented universally, the GAN learns more quickly and stably than in earlier versions.

Import the necessary functions/libraries for building a GAN with Keras. Ideally, tensorflow-GPU 2 is the version of tensorflow being used.

future work batch norm and dropout

4.1. Inputs Required to Build a Model

In order to build a model with Keras, a couple inputs must be understood.

One input is the batch size, which is the number of training examples the model trains with in one epoch. The batch size is left as a constant 128, regardless of detector or GAN training. It is standard practice in much of machine learning to keep the batch size as a power of 2. Typically batch sizes are lower than the one being implemented here in order to prevent overfitting of the detector or generator model. But the choice to keep the batch size a little larger than standard was made due to the large size of the H4C training set, which consists of all unique measurements, allowing overfitting to be less likely.

The other input is the input shape of a single example in the batch that is to be sent to the input layer of the neural network.

4.2. The Detector Model

The detector model is built sequentially, meaning that each layer is stacked atop one another. Therefore, there is only one input tensor and output tensor for each layer, and the output of a layer above is the input to the layer below.

The detector model is built with an initial three convolutional layer blocks (with a built-in input layer within the first block), three subsequent fully-connected layers (or Dense layers), and a final, fully-connected layer that outputs that serves as the output layer.

The convolutional layers (Conv1D) are the first set of hidden layers. The purpose of these convolutional layers are to extract features from the auto-correlation plots. The initial convolutional layers are meant ot extract low-level features, such as minute changes in slope throughout the frequency channels of the auto-correlation. The later layers are meant to detect higher-level features, such as common peaks and troughs in the auto-correlation plots. Combining these layers together allow the detector to have a complete knowledge of what REAL and FAKE auto-correlations should look like on any scale. The differences between these layers are the number of filters/features that it can detect per pixel (or frequency channel in the case of auto-correlations) and the kernel size. Kernel size represents the number of adjacent pixels that contribute to the calculation of an output feature. The larger the kernel size, means the more information that contributes to a detected feature. This is why larger kernel sizes are associated to higher-level features. It is standard practice in convolutional networks to increase in feature numbers and kernel size the deeper the convolutional neural network develops. Other parameters of the convolutional layers include strides and padding. Padding is assigned as "same" in order to preserve and account for end conditions of the auto-correlations, as they include equally important information to highlighting the validity of an auto-correlation. Other padding methods do not weigh end conditions as significantly. Lastly, the stride size is left as 1. Stride size is the amount of pixels (or frequency channels) that the kernel moves between feature detections. It is a form of compression of the input data set.

The stride size is left as 1 due to the use of external pooling layers. Pooling, for use after convolutional layers, is a form of down-sampling the feature map output from the convolutional layers to a lower-resolution version that still contains the important features that is used to identify the object (or auto-correlation in this case). Downsampling is a method to prevent overfitting by delocalizing common features, allowing the detector to learn to search for a feature in a region of channels rather than a singular one. This will help the detector when it sees auto-correlations that it has not been exposed to in training. The stride parameter is the factor at which the input is down-sampled by (2 in this case). Two common options of pooling are Average Pooling (calculates average of a patch to be down-sampled) or Maximum Pooling (uses the maximum value from a patch to be down-sampled). AveragePooling1D is used over MaxPooling1D as it has been seen in the machine-learning community to yield more stable and convergent GAN training results. Padding is left as "same" for the same reason as mentioned previously. Also, staggering the pooling layers to occur after every 2-3 convolutional layers has seen success by others when building convolutional neural networks. Additionally, pooling at every layer would compress the feature map too much, causing less dependeable results from the deeper layers (as they would be fed extremely compressed images).

The Dense layers that follow are not standard of a typical convolutional neural network detector model in a GAN. However, research proved that a combination of fully-connected and convolutional layers can result in quicker and more stable GAN training. The paper referencing this claim can be found here. As per the referenced research paper, three intermediate fully-connected layers following the convolutional layers should be implemented. On a high-level, these fully-connected layers can be described to operate by performing various matrix-multiplications on all the ouput from the previous layer. These computational matrices update over time, as their weights are trained with the GAN. The number of output neurons from the layer is specified during model building. That specification determines the number of matrices being created and updated through model training.

The Flatten layer is a utility used to flatten the two-dimensional output from the convolutional layers (down-sampled auto-correlation, feature map). This is done in order to preserve all the data from the convolutional layers, as the subsequent Dense layers only process the last dimension of an input tensor. All the data included in output tensor from the convolutional layers are important and must be carried through to the end of the model, so flattening is a necessity.

The last layer (or output layer) is a Dense layer with an output neuron of 1. It uses the Sigmoid activation function, as it can only produce values between 0 and 1. Recall that the purpose of the detector is to be a binary classifier. The classifications are: REAL or FAKE. These flags are converted to 1 and 0, respectively, in order for the neural networks to intepret them. Therefore, the detector is required to only output values between 0 and 1. Therefore, sigmoid activation is used.

4.2.1. Compiling the Detector Model

Model compilation is necessary for any model that is to be trained on directly.

Binary Cross-Entropy is used as the loss function for the detector. This loss function is used, as the purpose of the detector is to discriminate between two classes (REAL or FAKE/0 or 1). The returned loss value is "log loss". More can be read here about this loss function. It is a common standard for most basic unsupervised, binary GANS.

The ADAM optimizer with a learning rate of 0.0002 and momentum of 0.5 is used per reasoning given in the introduction of Sec. 4.

Lastly, the accuracy metric is compiled within the detector model. Specifying this metric allows for better understanding of the detector's performance during training/evaluation. The output will be between 0.0 and 1.0. The output is a ratio of correctly identified inputs divided by total number of inputs in the training or validation batch.

4.2.2. Building and Visualizing the Detector Model

The detector model is to be defined with a batch size of 128 and an input dimension of 1536 (number of frequency channels). It is important to note that the input shape into the detector is (1536,1), in order to be compatible with the shape parameter requirments of the Conv1D layers.

The detector model is now properly built and compiled for training.

4.3 The Generator Model

The generator model is also a sequentially built model. It is another fully-connected and convolutional neural network, like mentioned in this paper. However, this auto-correlation generator takes a slightly more simplified approach, in order to reduce computational cost and unnecessary layers (by removing some inital dense layers).

The generator model begins by taking an input latent space in one-dimension. This latent space is an assortment of random values drawn from the standard normal distribution (Sec. 1.2) of a user-specified length. The objective behind this neural network generator (once trained) is to be able to produce a realistic auto-correlation from a normal-random noise vector (or latent space). This realistic auto-correlation, once the GAN's architecture, hyperparameters, and training are perfected, should be able to fool those at HERA who visually confirm that validity of auto-correlations (at least that is the objective).

The first processing layer is a fully-connected Dense layer. It remaps the latent space to a certain desired amount of inputs (n_inputs).

In order to fit the logical structure of the architecture, the n_inputs value must be divisble by 64, as the next layer reshapes the input tensor into two-dimensions of (n_inputs/64,64).

The tensor is reshaped in order to fit the logic of the input structure of the following transposed convolutional layers (or deconvolution). The input shape follows the same form as the output of the convolutions in the detector. The first dimension is a compressed auto-correlation and the second is a feature map. The deconvolutions (represented by the Keras function, Conv1DTranspose) perform the inverse operation of a convolution/pooling combination, meaning the the feature/filter amount follows the opposite flow as well. The number of features start large and decrease the deeper the network gets. Simultaneously, the dimension that represents the channels of the fake auto-correlation increase with an inversestride of 2 at each layer.

The hyperparameters of the Reshape and Conv1DTranspose architecture are specifically designed for an output tensor of (1,1536) from the first Dense layer, as the result of the last deconvolution is a shape of (1536,1), which is the defining shape of the auto-correlation tensor in the detector model. Therefore, n_inputs is meant to be 1536. The tensor is flattened to be compatible with the next dense layer.

The last layer is a Dense layer that maps an input of 1536 to 1536. Although this may seem unintiuitve or computationally heavy (requires 15362 parameters), it is implemented to account for the reproduction of RFI channels. The RFI channels, typically being large spikes on one single frequency channel, proved a difficult task for exclusively deconvolutional layers to reproduce in preliminary network designs. The deconvolutions have a potential to miss extreme trends in data that do not span multiple frequency channels, as they hinge on the relationship between adjacent channels to produce a quality result. Although computationally more expensive than a deconvolution, the final Dense layer seems to potentially solve this problem, most likely serving as the layer that injects most, if not all RFI channels into the output fake auto-correlation (however, this cannot be proven due to the "hidden" nature of neural network decision-making). The layer can configure itself to greatly increase the magnitude of a single output channel through the use of matrix multiplication.

The last layer used LeakyReLU activation, as the generator must be versatile in its ability to output auto-correlations. The generator is meant to produce auto-correlations of the same scale as the training set (which are typically on the scale of 100 and positive once normalized). This activation function can produce values within the range of the REAL auto-correlations with ease. It can also produce small negative numbers if necessary (this sometimes occurs when logarithmic normalization is used: see Sec. 5.1.1).

The output is a (1536,1) tensor and is meant to increasingly resemble a plotted H4C auto-correlation as the model is trained.

4.3.1. Compiling the Generator Model

The generator model is not compiled exclusively, as it is not trained on its own. In the next section Sec. 4.4, the GAN is built. The generator is compiled within the GAN, as it trains based off of the performance the detector has when being fed FAKE auto-correlations that are flagged as REAL (see Sec. 5.2 for training process). The generator does not use its own loss/accuracy results to train; therefore, not requiring exclusive compilation.

4.2.2. Building and Visualizing the Generator Model

The generator model reuses the same batch size as the detector to keep consistency. The input shape into the generator model is the length of the desired latent space to be used within the generator. In this case, it is 22, which is motivated by a UC Berkley graduate student, Christian Bye (who provided advice and direction for this project). In his own machine-learning project, he used 22 latent dimensions and achieved success, so it was carried over for use in this project. More decisive decision into latent space size and assignment requires much more in-depth analysis and could be an area of further work to perfect GAN training. The shape of the input latent space is (1,22), so that the Dense layers can perform computations on the latent values. The other input, n_inputs represents the number of neurons to map to for two of the Dense layers in the generator architecture. This input is designed to reflect the number of desired frequency channels to be produced in the output, which is 1536.

The generator model is now built and ready for compilation within the GAN.

4.4 The GAN Model

The GAN model architecture differs from the generator and detector models in two very different ways:

  1. The model is not sequential. It is built as a functional API.
  2. The model contains other models (the generator and detector) within it.

The GAN model is an overarching combination of both the generator and detector models. One input takes the shape of the generator's input, because the generator model is one of the initial layers in the GAN model pipeline. However, there are actually two initial layers. This is what defines a functional API model in Keras. It is not sequentially built, meaning the output tensor of the layer above is not automatically the input tensor for the next. Instead, the input tensor must be specified. With this functionality, the model is now open to have outputs of layers saved and called in later layers non-consecutively. This capability enables the model to apply one-centered gaussian noise (with a standard distribution based from the Radiometer Equation) to the generator's fake auto-correlation output (see Sec. 3). The output from the Gaussian Noise and the output from the generator model are combined through the Multiply layer, which multiplies two input tensors to one output tensor output. The output from this layer should resemble an auto-correlation with realistic noise at each measured frequency channel. This tensor is inserted into the detector model, where the output is a binary classification.

It is important to note that the detector is flagged as untrainable. This will come into play when discussing the training of the GAN.

4.4.1. Compiling the GAN Model

The GAN model is compiled for training the same as the detector model is. This is done because the output of the detector is the output of the GAN model. This means binary cross-entropy is best as the loss function. The optimizer remains as ADAM, as this has been proved to be a viable optimizer for GANs. However, the accuracy metric is not necessary, as the input auto-correlations into the GAN's detector model will be falsely flagged (see Sec. 5.2 for more details); therefore, the metric would not give useful results.

4.4.2. Building and Visualzing the GAN Model

The pre-built generator and detector models are two necessary inputs for building the GAN. They are called within the GAN model. The GAN reuses a batch size of 128 for consistency. The latent_dim is equivalent to that of the generator model, as the GAN uses the same generator parameters as when it was built by itself. n_inputs is 1536 as it was in the previous models. This number is required as an input to define the noise vector that is factored into the generator output. Lastly, stddev is the standard deviation calculated in Sec. 3.2 and is required for the noise application within this model.

All inputs into define_gan() are defined within cells above.

The GAN is now built and ready for compilation for training.

5. Training the Models

This section contains all functions and calls required for training the GAN, detector, and generator models. It also includes all the pre-processing functions required for preparing REAL and FAKE data for training.

5.1 Pre-processing and Preparing Data Sets

The following functions are to be called within the larger train() function, which will be described further below. These functions normalize and organize both the REAL and FAKE data. They also create the classification vectors that are fed into the training function in order to specify what the input training data is (REAL or FAKE).

5.1.1. Generate REAL Training Batches

The function below serves three purposes:

  1. Concatenate a batch of REAL auto-correlations (from data input) of a desired batch_size
  2. Normalize the data within the batch by a desired normalization technique (norm_tech)
  3. Create a REAL classification array of length batch_size

Normalization pre-processing of the training set is important to perform on the auto-correlation data. In some auto-correlation data, dependent on the current technique of HERA's correlator, RFI channel magnitudes may differ significantly from the rest of the auto-correlation. Neural networks, as seen in preliminary drafts of these models, had difficulty training to these intense magntiude spikes. It is possible for these primarily convolutional neural networks to train to these large spikes, but it was seen to take significantly longer to train. This would make the development/improvement of network architectures and hyperparameters difficult to do quickly. However, normalization can solve this problem, by decreasing the scale of these training auto-correlations.

Three types of normalization options are possible with the function below:

  1. Median Normalization: normalize by dividing data by the auto-correlation median
  2. Natural Logarithmic Normalization: normalize by taking the natural logarithm of the auto-correlations (NOTE: This technique is omitted from analysis for interest of memo length. But it is an area of future research.)
  3. Fixed-Point Normalization: normalize by dividing data by a fixed input

Once the data is normalized and organized into a batch. It is input into the detector for training.

A comparison of normalization techniques will be given later within this report.

5.1.2. Organize Validation Data Set

This function normalizes the validation set that was created in Sec. 2.1. It can also be used to normalize the test set (or any set of the same format). The normalization techniques are the same as the function above. The same technique is used on the training, validation, and testing set that are associated to the same model training run, or results will be inconsistent.

The validation set is key in analyzing the training of the model. It allows insight into whether the models are being overfit to the current training set. The validation set is kept separate from the training set and is never used for model training. This means the detector model will never be influenced by the validation data. In theory, when comparing the losses/accuracies that the detector returns for the validation and training sets at each epoch, they should be equivalent. This is because both sets represent the same classification and are randomly selected.

In the case that the validation set and training set return performance data that significantly differ from one another, the model training process may have a flaw. The most common flaw is overfitting of the training set. This means that the detector model over trains to the training set and loses its ability to detect a good, random auto-correlation. It possibly trained itself to idetify a trend only seen in the training set and is not indicative of any REAL auto-correlation. This can be seen with the accuracy of detection decreasing for the validation set as the accuracy of detection for the training set increasing or remaining the same over many epochs. Validation sets are essential for anlayzing the useability for GANs.

5.1.3. Generate FAKE Data Batches

The generate_latent() function below creates a latent space array of a certain length (latent_dim) and certain size (batch_size). The latent space has its values drawn from the standard normal distribution (see Sec. 1.2). It is used as the input batch into the generator when producing FAKE auto-correlations.

The generate_fake() function inputs a latent space of a certain batch_size into the generator. The generator then outputs its FAKE data. It creates as many auto-correlations as the defined batch_size. These auto-correlations may look realistic, depending on how far the models are into training. It also returns FAKE classification flags for each generate auto-correlation. These outputs are to be used in training the detector.

5.2 Training Function

5.2.1. Training Methodoolgy

GANs require a specific training scheme in order to train both the detector and generator models properly. The breakdown is as follows.

The detector is trained exclusively (as it is untrainable within the GAN model). That is why it is given the ability to be compiled by itself. At each epoch, the detector is fed a batch of REAL auto-correlations with REAL flags and a batch of FAKE auto-correlations with FAKE flags. The detector trains by using Keras train_on_batch function. This function allows the manual training of a model at each epoch, rather than using a function that automates the process and trains for multiple epochs with one function call (the fit function does this). This function is a necessity, as after each epoch, the generator will be capable of producing more realistic FAKE auto-correlations; thus, the training function must be iterable in order to insert better FAKE batches at each epoch. Therefore, the detector will get increasingly better at detecting FAKE auto-correlations that increasingly resemble REAL auto-correlations as training proceeds. To note, the training is separate for REAL and FAKE values due to trial and error. Training the detector separately on these batches saw faster training than if they were combined.

The generator is trained through the GAN model. The GAN model uses train_on_batch as well, for the same reason the detector does. The training is not as straightforward as the detector training may seem. The input batch for training the GAN consists of a latent space of a specific batch size, along with an identifier list of REAL flags of the same batch size. The latent space is fed into the GAN architecture, which means the values are fed into the generator first. The generator produces a FAKE auto-correlation, which is multiplied by the Radiometer noise (see Sec. 3). This new FAKE auto-correlation, with noise accounted for, is input into the detector architecture. The detector, which is untrainable in this architecture, will output a classification (0 or 1) for what it thinks these FAKE auto-correlations are (either REAL or FAKE). Then the loss will be calculated based from the comparison between comparison of the detector's classification and the REAL flags associated to the generator's auto-correlations. The loss value from the detector is what influences the training of the generator's weights. The generator trains on FAKE data that is labelled as REAL, because the generator's goal is to create the auto-correlations that trick the generator. So, for example, if the loss value is high for the GAN training, the generator did not do well in tricking the detector (as large loss points to disagreement between input flags and detector output classifcations). If the loss value is low, the generator did well in tricking the detector and does not need to update its weights as much.

This dual-training leads to a zero-sum game. The generator is constantly improving, as it uses the detector's performance to update itself. The detector is constantly improving, as it is fed more and more realistic FAKE data. The result should be a generator that can produce realistic auto-correlations that can trick the manual observer and a detector that can detect even the slightest deviation from a proper auto-correlation (in order to detect possible, unseen malfunctions in antennas).

5.2.2. Training Function Breakdown

The training function not only trains the models, but it also shows its progress during training and saves/presents performance data.

All data pertaining to the training of a GAN is saved to the final_training-analysis directory, where different runs can be found (which will be labelled depending on the normalization technique used: 1 for median normalization and 2 for fixed-point normalization).

All data stored within has a READ-ME to describe them.

6. Model Training

Below are the function calls and variable assignment for training various GAN models. The training will differ dependent on the pre-processing normalization technique of the training/validation sets. Differing techniques are used to compare the efficiencies in training (by seeing how well each model trains and tests after a set 10,000 epochs of training. 10,000 epochs of training is chosen arbitrarily but kept consistent to compare models. The same analysis (ie. testing if the model detects bad auto-correlations as FAKE data) will be done on each model in order to help determine which normalization technique trained the model the most efficiently and reliably.

Median normalization is the main method of pre-processing for this model training (as it is most standard). It will use the previously defined models from Sec. 4 to train on. Fixed-point normalization will also be explored in training, in order to explore possible training efficiency improvements. Additionally, natural logarithmic normalization is a possibility (but would require minor modifications to some model architecuture when implementing Radiometer noise), but it is omitted in this report in interest of length. This could be explored in the future.

6.1. Median Normalization Training

The following training is for median normalization method of pre-processing the training and validation auto-correlations. The directory model_ID-1 will pertain to the training data and performance of the median training run.

Call the training function.

Due to an error in the notebook save file, the epoch progress plots were not shown inline with the notebook. So the final result and performance analysis plots will be called from their respective directories. Please refer to training-snapshots and training-performance directories for more plots and data.

6.1.1. Post-Training Summary

The training for median normalization of the REAL training set is complete.

But analyzing GAN training quantitatively and robustly can prove difficult. There are developing strategies (such as inception score) that are used to quantatively analyze the performance/success of a GAN. However, the timeline for this project (10 weeks as of writing) did not allow for ample time to adapt and implement such a method.

A common and simple standard of GAN training performance analysis is by visual inspection of generated images. Viewing the epoch snapshots, by epoch 10,000, the generator (blue) produces very realistic, normalized plots. As expected from preliminary analysis, the RFI channels (especially around 100 MHz) do not keep exclusive to singular channels as in standard in REAL auto-correlations. The RFI spikes fall over into nearby channels, causing them to have greater magnitudes than expected (see example of REAL auto-correlation in Sec. 1.1). Additionally, the generated plot still shows some inconsistent "noise" or "bumps" in its larger trends. For example, the peak between 50 to 75 MHz should be almost perfectly smooth (aside from the RFI channel at about 65 MHz), but the generated image still shows smaller fluctuations that deviate from the expected smoother trend.

The performance plots presented at the end of the run do raise mild concern. There are rather large and unexpected spikes in both detector and GAN losses in training. The current cause of this issue is unknown and can be an area of research in the future. However, when observing the accuracy plots of the detector, the validation and training results remained along the same trend and did not increasingly deviate as epochs progressed. This results suggests that the models are, at least, not over-fitted. This could mean more training should/can occur to possibly improve the models.

Regardless, the GAN showed general success as means of proving the concept of machine-learning for HERA auto-correlations. The generated plot looks extremely similar (when ignoring issues mentioned above) to REAL normalized auto-correlations. The model trained efficiently, showing resemblence of an auto-correlation as soon as 4,000 epochs. So the architecture and hyperparameters are proving to be optimized to HERA-type auto-correlation data (but improvement can still be made).

6.2. Fixed-Point Normalization Training

The fixed-point normalization technique is used to pre-process the data for model training in the below training run. The fixed-value used is 5 x 106 V2. This value was picked arbitrarily from simple observation of the scale of an REAL auto-correlation, which lies typically within the order of magnitude of 106-7 V2. This training method will be compared with the performance of other training techniques used.

Call the training function.

Unfortunately, the notebook lost direct connection during the code execution, meaning epochs 5,000-10,000 were not displayed inline with their snapshots. However, the model still trained to completion (as the code still executed remotely), so the result is still a fully trained model. Below is epoch 10,000 snapshot and performance plots.

6.2.1. Post-Training Summary

The training for fixed-point normalization of the REAL training set is complete.

When viewing the final epoch result from the generator, it seems as though the generator was less successful than the first training run. However, this may be a case in which (at this specific epoch), the generator attempted a different weight preset to improve in model learning. If this is the case, looking back into older epochs may help give insight into if the model were on the track to succeeding. However, when viewing previous epoch snapshots (specifically 7,000-9,000), the model seems to be on a non-convergent path to training. Epoch 7,000 seems promising, but the generator struggles (even more significantly than median-normalized training) to produce proper RFI spikes. Additionally, the general trends in the generated results are littered with minor, small fluctuations (see epoch 9,000) that do not smooth out as training progresses..

The performance plots raise concern. The loss spikes, across all models, are greater in magnitude than the median-normalized training run. This is cause for concern, as mentioned when summarizing the previous run as well. It should be investigated. However, the accuracies for training and validation do not converge from one another, suggesting that no over-fitting has occurred. Additionally, it seems the the detector model improved in its ability to detect REAL auto-correlations in the final epochs (9,000-10,000). Although this should not occur when training a GAN (the detector should always be slightly confused when presented REAL data), this could prove beneficial for the purpose of deliniating between PASS and NO PASS data.

Comparison between the two training models will be further investigated, as their perfomances did not show any major concerns when compared to one another (however both do behave strangely; ie. loss spikes).

7. GAN Analysis

7.1 Generator Analysis

There are not many quantitative methods to analyze the generator rather than visual observation. But visual observation should prove enough analysis for creating an efficient generator (one of the goals of the project) which trained successfully in 10,000 epochs. Below are random generated auto-correlations from the final generator models. They are unnormalized by the inverse of their respective techniques (NOTE: for median normalization, the standard median was seen to be ~ 1e7).

When comparing the above plots to the good auto-correlation from the introduction to this report, it is quite obvious that the median-normalized generator had better success when it comes to creating more realistic FAKE auto-correlations. The fixed-point-normalized generator is oddly creating a lot of minute noise rather than smooth signal. This could be a red flag, signaling that the detector possibly made a decision that smooth signal was in fact indicative of a FAKE auto-correlation, causing the generator to add random noise to its output to fool it. Potential over-training or another error in model training or data normalization may have occurred.

This will be noted when analyzing the fixed-point-normalized detector.

7.2. Detector Analysis

Recall the goal of this study: create a useable detector model for HERA use.

However, testing the detector model and presenting its use in the HERA pipeline may be more difficult than originally expected. Unknown before the project began, when training the detector on REAL and FAKE data, it often becomes confused in its ability to identify REAL data (often returning accuracies betwen 60-100% at random between training epochs). This can be seen best in the accuracy plots from the training section. This will again be shown below with use of a testing set (similar to validation set but smaller in size).

It can be seen above that the median-normalized detector has difficulty identifying good auto-correlations from the testing set. But the fixed-point-normalized detector seems to perform better. However this example is not conclusive, as it is for a small dataset (of little of 1,000 auto-correlations) and does not include any bad data from broken antennas.

The detectors have not seen bad auto-correlations before. But the entire purpose of this experiment is to make the cross-over between GAN learning and the HERA antenna classification pipeline. In order to make the detector useable, it must be able to discriminate between good and bad antennas by their auto-correlations. Useability is a goal of the project and will be directly tested below.

The detector will be given a rigorous test. It will be fed 10 sets of auto-correlations from good antennas and 10 from bad antennas. These antennas were determined good or bad here. The detector will see auto-correlations from the entire night of observation (meaning every auto-correlation from a specified antenna). However, it is important to note that some bad antennas are "bad" at varying degrees of certainty. Some are quite obviously dead or inconsistent signals, while others only have small deviations. A mix is used in the test batch of 10 antennas. This is important to know, as some antennas may not be entirely NO PASS the entire night and may be flagged as REAL at some times within the observational period. This is expected and may even point to the possible, accidental removal of quality data from HERA analysis, as some auto-correlations from a manually-flagged NO-PASS antenna may be of quality.

The detector is tested with the predict functionality within Keras. It outputs a class prediction for REAL or FAKE (but when applied to the HERA pipeline: PASS or NO PASS). However, the prediction is not binary, but is presented as a binary-probability classification between 0 and 1. It outputs a probability, as the detector can make classifications at varying levels of confidence dependent on the input auto-correlations. But values closer to 0 represent a NO PASS prediction, and values closer to 1 represent a PASS prediction.

The figure presented is a point-density function. The horizontal axis is integration time or LST (dependent on the subplot) and the vertical axis is the binary-probability classification for the given auto-correlation for the associated point on the horizontal axis. All antenna classifcations are overlaid onto their respective (known good antennas or known bad antennas) and plotted as a point-density plot. The greater the number of points on top or closer together (meaning two or more antennas had similar classifications) is presented by a redder color. This process is performed with this function and this colormap. The bluer the points, shows less classifications present. The redder shows more.

By viewing the plots above, many features in the detectors performance can be noted. First, and most importantly, the detector detects a higher density of NO PASS data from the known set of bad antennas than it does from the set of good antennas. This can be seen from the difference in colors closer to the 0-classification when cross-comparing the good and bad antennas. This result is very exciting, as it offers validation of this experiment's intent of applying GAN machine-learning to data-quality processing. These results prove that this case study has real potential in becoming a useable tool in the future of HERA.

Regardless of that success however, there are certainly some oddities within these results. First off is the glaring issue of there being still a much higher density of PASS classifications than NO PASS classifications in the bad antenna data. Additionally, there seems to be a temporal pattern within the detections, with the bad antennas being flagged as PASS later in the evening of observation. This may be due to a certain body in the sky (ie. galactic center) increasing the magnitude of the measurements signficantly (or maybe the opposite) during this period. This increase may either break the normalization scheme or cause a certain feature to arise that tricks the detector into thinking it is a quality measurement. This trend can even be seen in the good antennas plots, where the intensity of confident PASS classifications increase significantly in the same area. Another possibility is that the data in these time periods may actually be useable but were removed along with the bulk of the bad data associated ot that antenna. But this explanation is less likely due to the correlation across good and bad antennas, pointing to a possible sky anomaly or bright source that causes the detector to be deceived.

Additionally, the good antennas are not declared PASS with extreme confidence. It can be seen that the detector mislabels auto-correlations or is confused throughout the observational night and regardless of LST. This can possibly be credited to the nature in which GANs are trained (mentioned in beginning of this section), where it becomes confused on REAL measurements deep into training. Or, this could be credited to there being a few measurements from the good antennas that are actually bad.

But the results above are a general success. Good antennas are more often flagged as PASS than not, and the detector has some success in flagging bad antennas. It is likely that, if this detector is ever improved/implemented, that the detection will work on a density of classification basis. This means, if an antenna output has > X% probability classifcations lower than Y binary-probability classification, the antenna should be manually checked to determine the issue.

Perform same analysis for fixed-pointed-median normalization.

Fixed-point-median analysis ended with a very unusual result. It seems, for the most part, the detector never properly trained for the correct features of a REAL auto-correlation. The overwhelming NO PASS classification is indicative of this. One possibility is that the detector over-trained to the training set, causing the detector to lose understanding of what a random, REAL auto-correlation looks like. Or, the fixed-point normalization is not robust enough or requires a different fixed-point. Depending on the magnitude of the features in the auto-correlation, the features could change signficantly from measurement to measurement. It does seem there are slightly more PASS classifications from the known good antennas, pointing to very minor success. But all in all, it seems as though this normalization failed and requires more investigation.

8. Conclusion

Applying generative adversarial neural network (or GAN) learning to HERA auto-correlations is no simple task. The difficulty lies within choosing a proper training set (as not all auto-correlations have been screened), building a neural network architecture that lends itself to both identifying larger trends and extreme magnitude spikes, and refining a normalization scheme that does not cause loss of valuable trends/patterns for the network to derive essential feautures from. The project proved a successful proof-of-concept, seeing the creation of a fully-connected-convolutional GAN for training on normalized auto-correlation data. The median-normalized models (which typically resulted in the division of the auto-correlation by a factor of ~ 1e7) developed the ability to create close-to-realistic auto-correlations, and it showed minor success when making the leap between detecting REAL vs. FAKE to PASS vs. NO PASS data. In analysis, it was seen that the detector could pick out some broken antennas from the rest of the larger group. There is a lot of potential for this project in the future, and with refinement, it could become a useful tool to the HERA collaboration.

8.1. Future Work

There are many possibilites and avenues for future development of this model in order to refine its ability to detect and create auto-correlations. The following tasks will be listed in order of significance:

  1. Refine/rework a normalization technique in order to prevent loss of a imperative auto-correlation features (try natural logarithmic normalization).
  2. Implement early-stoppage and learning-rate scheduler in order to train model for longer without worry for over-fitting.
  3. Implement batch normalization and dropouts to improve training performance.
  4. Experiment with new architecture and activation functions in models.
  5. Optimize latent space noise vector.
  6. Investigate the possbility of a second deep-learning neural network that identifies errors in auto-correlations (and points to antenna problems) after being flagged as NO PASS by the detector.